remove_non_tensor_columns by garrett361 · Pull Request #831 · allenai/open-instruct

garrett361 · 2025-07-26T13:24:52Z

In #765 additional string metadata was added to the dataset which conflicts with the application of the DataCollatorForSeq2Seq collator function, which expects only tensor data. This makes finetune.py fail (when --packing False) with errors like:

Traceback (most recent call last):
  File "/proj/data-eng/swanand/sft_dpo/venv_sft_dpo/venv/venv_sft_dpo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 767, in convert_to_tensors
    tensor = as_tensor(value)
  File "/proj/data-eng/swanand/sft_dpo/venv_sft_dpo/venv/venv_sft_dpo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 729, in as_tensor
    return torch.tensor(value)
ValueError: too many dimensions 'str'

This error arises when trying to create a tensor from a list of strings, e.g. torch.tensor(["hello"]).

This PR adds a utility for filtering non-tensor columns out of the dataset before using and uses the filter in both the sft and dpo scripts.

CC @hamishivi @jacob-morrison

The issue specifically is the addition of the DATASET_ORIGIN_KEY metadata here.

I believe this wrapper is only needed for finetune.py and dpo_cache_tune.py, and not needed for reward_modeling.py or reward_modeling_eval.py, but have only tested finetune.py with these fixed e2e.

hamishivi · 2025-07-28T08:19:00Z

Thanks for noticing and the PR, I believe this is fixed by #825 (which I just merged).

garrett361 · 2025-07-28T14:10:06Z

Yes, hadn't seen that one! Thanks.

remove_non_tensor_columns

a730b24

hamishivi closed this Jul 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove_non_tensor_columns#831

remove_non_tensor_columns#831
garrett361 wants to merge 1 commit intoallenai:mainfrom
garrett361:upstream-fix

garrett361 commented Jul 26, 2025

Uh oh!

hamishivi commented Jul 28, 2025

Uh oh!

garrett361 commented Jul 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrett361 commented Jul 26, 2025

Uh oh!

hamishivi commented Jul 28, 2025

Uh oh!

garrett361 commented Jul 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants